Get An Interesting GeoJSON File

Geospatial data can be expressed in maps, which can offer amazing levels of data density.

You’re probably familiar with JSON, which is frequently used to store and pass data between applications. GeoJSON applies JSON structure to geospatial data in a single JSON file. Let’s get data about child blood lead levels from the City of Philadelphia:

library(rgdal)
philly_lead_map <- readOGR('https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+child_blood_lead_levels_by_ct&filename=child_blood_lead_levels_by_ct&format=geojson&skipfields=cartodb_id')
## OGR data source with driver: GeoJSON 
## Source: "https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+child_blood_lead_levels_by_ct&filename=child_blood_lead_levels_by_ct&format=geojson&skipfields=cartodb_id", layer: "OGRGeoJSON"
## with 380 features
## It has 5 fields

Sanity Checks

Let’s take a peek at the tabular data associated with this geospatial object (not the actual map)

head(philly_lead_map@data)
census_tract data_redacted num_bll_5plus num_screen perc_5plus
0 42101000100 0 0 100 0
1 42101000200 1 NA 109 NA
2 42101000300 1 NA 110 NA
3 42101000401 1 NA 61 NA
4 42101000402 0 0 41 0
5 42101000500 1 NA 49 NA
summary(philly_lead_map@data)
census_tract data_redacted num_bll_5plus num_screen perc_5plus
42101000100: 1 Min. :0.0000 Min. : 0.00 Min. : 6.0 Min. : 0.000
42101000200: 1 1st Qu.:0.0000 1st Qu.: 7.00 1st Qu.:112.8 1st Qu.: 3.225
42101000300: 1 Median :0.0000 Median :14.00 Median :204.5 Median : 5.700
42101000401: 1 Mean :0.3395 Mean :17.63 Mean :225.1 Mean : 5.859
42101000402: 1 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:298.8 3rd Qu.: 8.475
42101000500: 1 Max. :1.0000 Max. :81.00 Max. :846.0 Max. :17.600
(Other) :374 NA NA’s :129 NA’s :8 NA’s :126

And let’s take a quick peek at the map, to make sure there are no initial, obvious problems:

library(leaflet)
library(leaflet.extras)

leaflet(philly_lead_map) %>%
  setView(lng = mean(philly_lead_map@bbox['x',], na.rm=TRUE), 
          lat = mean(philly_lead_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
  addPolygons() 

Well, the defaults are ugly, but worse, we see some “holes” in our map. This problem was also evident in the data, which had too few rows.

We’re missing some tracts, because Philly has 384 tracts, and this data only has 380. Clearly some tracts were not surveyed. Let’s get all of Philly’s tracts, since we want to map the entire city, and provide imputation as needed. We can get complete GeoJSON file for all the Census tracts in Philly.

Fill in any Geospatial Holes

full_philadelphia_map <- readOGR('http://data.phl.opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson')
## OGR data source with driver: GeoJSON 
## Source: "http://data.phl.opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson", layer: "OGRGeoJSON"
## with 384 features
## It has 14 fields

What kind of data is in this map?

head(full_philadelphia_map@data)
OBJECTID STATEFP10 COUNTYFP10 TRACTCE10 GEOID10 NAME10 NAMELSAD10 MTFCC10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 INTPTLON10 LOGRECNO
0 1 42 101 009400 42101009400 94 Census Tract 94 G5020 S 366717 0 +39.9632709 -075.2322437 10429
1 2 42 101 009500 42101009500 95 Census Tract 95 G5020 S 319070 0 +39.9658709 -075.2379140 10430
2 3 42 101 009600 42101009600 96 Census Tract 96 G5020 S 405273 0 +39.9655396 -075.2435075 10431
3 4 42 101 013800 42101013800 138 Census Tract 138 G5020 S 341256 0 +39.9764504 -075.1771771 10468
4 5 42 101 013900 42101013900 139 Census Tract 139 G5020 S 562934 0 +39.9750563 -075.1711846 10469
5 6 42 101 014000 42101014000 140 Census Tract 140 G5020 S 439802 0 +39.9735358 -075.1630966 10470

Oh, okay, it has Census Tract data. Is it complete? Let’s check… does our data frame now have 384 rows?

nrow(full_philadelphia_map@data)
## [1] 384

Let’s do a quick mapping sanity check. We’ll change the default color and line width settings this time around:

leaflet(full_philadelphia_map) %>%
  setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE), 
          lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
  addPolygons(
    weight = 1,  # border thickness
    opacity = 0.5, # border opacity
    color = "grey", # border color
    fillColor = "white",
    fillOpacity = 1,
    label = full_philadelphia_map@data$NAMELSAD10
  ) %>%
  suspendScroll()  # Handy to stop those accidental "scrolling by" zooms

This map looks great, and the tabular data might prove useful, too, for things like a “prettier” display of the Census Tract name, instead of just a long number. Let’s combine our lead data into this fuller map.

Merging Tabular Data

WARNING

merge will reorder your rows, and that breaks the relationship between the rows of tabular data and the corresponding polygons. Not what you want! So we’re going to keep track of the order thanks to the helpful “OBJECTID” variable, which is super helpful. If you’re working with data that doesn’t have a column like this, just add it, so you can keep track of what the original order was.

Merging in this case is pretty simple – we just have to bring in the lead data and make sure our “hinge” (overlapping field) is set up properly:

merged <-  merge(x = philly_lead_map@data,
                 y = full_philadelphia_map@data,
                 by.x = "census_tract",
                 by.y = "GEOID10",
                 all = TRUE)

full_philadelphia_map@data <- merged[order(merged$OBJECTID),]

Plot Color-Coded Data (Choropleth)

Let’s see what our lead levels look like without any imputation of missing values:

lead_palette <- colorBin("Blues", domain = full_philadelphia_map@data$perc_5plus, bins = 10, na.color = "#aaaaaa")

leaflet(full_philadelphia_map) %>%
  setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE), 
          lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
  addPolygons(
    weight = 1,  # border thickness
    opacity = 0.5, # border opacity
    color = "grey", # border color
    fillColor = ~lead_palette(full_philadelphia_map@data$perc_5plus),
    fillOpacity = 1,
    label = full_philadelphia_map@data$NAMELSAD10
  ) %>%
  suspendScroll()

Impute Missing Data

Let’s do some intelligent imputation of missing values. First, we identify the neighbors for each tract:

library(spdep)
coords <- coordinates(full_philadelphia_map)
tracts <- row.names(as(full_philadelphia_map, "data.frame"))
knn10 <- knn2nb(knearneigh(coords, k = 10), row.names = tracts)
knn10 <- include.self(knn10)

Impute data for missing rows. Note that we know that the actual number of children testing with high blood lead levels in the case of imputation is either 1, 2, 3, 4, or 5 children, so we know the actual possible percentages, if we know how many children were tested (true except for the “missing” tracts we had to fill in). So we’ll do knn imputation, but if the knn imputation comes in lower than the lowest possible suppressed percent, we’ll just toss in the lowest possible, and do a similar check with the highest possible suppressed percent.

Note that knn averaging might also give us an NA, because there are areas with no nearby tracts that were tested. In that case, just give us whatever the percentage is for 3 children testing positive (the mean and median value of all possibilities).

for (row in 1:nrow(full_philadelphia_map@data)) {
  
  # is this row missing the percentage of children testing with high blood lead?
  if (is.na(full_philadelphia_map@data$perc_5plus[row])) {  
    
    # lowest possible pct (if we know # of kids screened)
    low_pct <- (1/full_philadelphia_map@data$num_screen[row])*100  
    
    # highest possible pct (if we know # of kids screened)
    high_pct <- (5/full_philadelphia_map@data$num_screen[row])*100 
    
    # the "expected value" or median (and mean) number of kiddos with high lead
    median_pct <- (3/full_philadelphia_map@data$num_screen[row])*100 
    
    # take the mean of our neighbors
    knn_pct <- mean(full_philadelphia_map@data$perc_5plus[unlist(knn10[row])], na.rm=TRUE) 
    
    if (is.na(knn_pct)) {
      # no way to determine a good guess by neighbors?  Choose median.
      full_philadelphia_map@data$perc_5plus[row] <- median_pct
    }
    
    if (!is.na(low_pct) & !is.na(knn_pct) & low_pct > knn_pct) { 
      # knn lower than lowest?  Use lowest.
      full_philadelphia_map@data$perc_5plus[row] <- low_pct                  
    }
    
    else if (!is.na(high_pct) & !is.na(knn_pct) & high_pct < knn_pct) {  
      # knn higher than highest? Use highest
      full_philadelphia_map@data$perc_5plus[row] <- high_pct
    }
    
    else {
      # Otherwise, use knn.
      full_philadelphia_map@data$perc_5plus[row] <- knn_pct   
    }
    
  }
}

Re-map that Jawn

lead_palette <- colorBin("Blues", domain = full_philadelphia_map@data$perc_5plus, bins = 10, na.color = "#cccccc")

leaflet(full_philadelphia_map) %>%
  setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE), 
          lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
  addPolygons(
    weight = 1,  # border thickness
    opacity = 0.5, # border opacity
    color = "grey", # border color
    fillColor = ~lead_palette(full_philadelphia_map@data$perc_5plus),
    fillOpacity = 1
  )

Enrich!

And here we want to pull in our local file, which is a simplified version of data from the American Community Survey conducted by the Census Bureau. Let’s see what is contains.

economic_data <- read.csv("../Data/philly_census.csv")
head(economic_data)
census_tract pct_unemployed pct_commute_public_transit median_household_income mean_household_income pct_child_uninsured pct_families_below_poverty_line
4.2101e+10 2.6 33.4 103772 124525 33.9 1.7
4.2101e+10 7.6 22.0 50455 99337 6.5 13.7
4.2101e+10 4.6 11.5 93036 121867 3.3 3.4
4.2101e+10 5.9 21.8 57604 83354 0.0 23.2
4.2101e+10 2.0 14.6 70038 92625 0.0 0.0
4.2101e+10 12.3 20.6 40568 60813 9.1 0.0

This is selected economic characteristics of various census tracts. Let’s combine the data here with our map, and use labels to allow people to understand the data better:

merged_again <- merge(x=full_philadelphia_map@data,
                      y=economic_data,
                      by = "census_tract",
                      all = TRUE)

full_philadelphia_map@data <- merged_again[order(merged_again$OBJECTID),]


str(full_philadelphia_map@data)
## 'data.frame':    384 obs. of  24 variables:
##  $ census_tract                   : Factor w/ 384 levels "42101000100",..: 95 96 97 134 135 136 137 138 139 140 ...
##  $ data_redacted                  : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ num_bll_5plus                  : int  14 11 14 17 14 7 NA 6 NA 13 ...
##  $ num_screen                     : int  306 253 314 157 264 140 125 159 73 204 ...
##  $ perc_5plus                     : num  4.6 4.3 4.5 10.8 5.3 5 4 3.8 5 6.4 ...
##  $ OBJECTID                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ STATEFP10                      : Factor w/ 1 level "42": 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTYFP10                     : Factor w/ 1 level "101": 1 1 1 1 1 1 1 1 1 1 ...
##  $ TRACTCE10                      : Factor w/ 384 levels "000100","000200",..: 95 96 97 134 135 136 137 138 139 140 ...
##  $ NAME10                         : Factor w/ 384 levels "1","10.01","10.02",..: 369 370 371 43 44 46 47 48 49 50 ...
##  $ NAMELSAD10                     : Factor w/ 384 levels "Census Tract 1",..: 369 370 371 43 44 46 47 48 49 50 ...
##  $ MTFCC10                        : Factor w/ 1 level "G5020": 1 1 1 1 1 1 1 1 1 1 ...
##  $ FUNCSTAT10                     : Factor w/ 1 level "S": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ALAND10                        : int  366717 319070 405273 341256 562934 439802 562132 789935 570015 609439 ...
##  $ AWATER10                       : int  0 0 0 0 0 0 0 277434 282808 0 ...
##  $ INTPTLAT10                     : Factor w/ 384 levels "+39.8798897",..: 106 114 113 141 137 132 126 110 120 131 ...
##  $ INTPTLON10                     : Factor w/ 384 levels "-074.9667387",..: 350 362 370 257 234 208 168 129 113 135 ...
##  $ LOGRECNO                       : Factor w/ 384 levels "10335","10336",..: 95 96 97 134 135 136 137 138 139 140 ...
##  $ pct_unemployed                 : num  19.2 12.3 15.2 15.8 13.4 11.7 17.9 1.5 6.1 10.7 ...
##  $ pct_commute_public_transit     : num  49.1 53.2 39.4 19.6 37.8 49 28.9 21.2 21.8 24.9 ...
##  $ median_household_income        : int  18408 27708 24402 28534 14314 24474 19167 85765 64167 54393 ...
##  $ mean_household_income          : int  31068 33932 33784 41352 30200 36123 43450 115164 95890 62740 ...
##  $ pct_child_uninsured            : num  4.6 20 1.2 4.8 1.4 3.7 0 4.4 3.9 9.4 ...
##  $ pct_families_below_poverty_line: num  41.5 32.9 39.1 24.2 47.7 33.4 20.5 0 3.7 28.2 ...

Let’s create some useful labels:

labels <- sprintf(
  "<strong>%s</strong><br/>
  Families Below Poverty Line (%%): %g <br/>
  Children With High Blood Lead Levels (%%): %g",
  full_philadelphia_map@data$NAMELSAD10, 
  full_philadelphia_map@data$pct_families_below_poverty_line,
  full_philadelphia_map@data$perc_5plus
) %>% lapply(htmltools::HTML)
lead_palette <- colorBin("Blues", domain = full_philadelphia_map@data$perc_5plus, bins = 10, na.color = "#cccccc")

leaflet(full_philadelphia_map) %>%
  setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE), 
          lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
  addPolygons(
    weight = 1,  # border thickness
    opacity = 0.5, # border opacity
    color = "grey", # border color
    fillColor = ~lead_palette(full_philadelphia_map@data$perc_5plus),
    fillOpacity = 1,
    label = labels
  )